For this project, I utilize a portion of the data that I also use in my PhD thesis, which was obtained from my company Vytal. Vytal is a startup based in Cologne that provides a digital deposit-free reusable packaging system for take-out food and beverages. This system offers restaurants, canteens, supermarkets, and delivery services a convenient and sustainable packaging solution.
Vytal employs serialized containers that are distinguished by a unique QR code on their lids. The process unfolds as follows: customers are required to download an app and register in order to access the reusable containers free of charge. When a customer places an order using the reusable containers, both the QR code on the lid and the QR code from the customer’s app are scanned by the gastronomist. This allows for matching the container with the customer. Vytal operates on a pay-per-use business model. Whenever a container is issued from the gastronomist to a customer, Vytal charges the gastronomist a fee of €0.15 to €0.25, depending on the type of container.
Vytal offers 13 different types of containers, which are grouped into 3 classes. A restaurant has the flexibility to choose which classes of containers they will accept for return, but they must accept the classes from which they initially provided containers to their customers. Consequently, a customer can return their container to several stores within the network of Vytal-affiliated establishments. When a container is returned, only the QR code on the lid needs to be scanned in order to complete the return process. Subsequently, the container undergoes cleaning and is made available for the next customer.
Thanks to the serialization, Vytal is able to collect item-level transactional data. This implies that each time a container is issued or returned, Vytal captures information such as the previous owner and the new owner of the container, along with the container type and a timestamp.
For this project, I aggregate the data on a daily basis from three specific stores that I call Store 1, Store 2, and Store 3. The objective is to make predictions for both the containers returned by customers and those issued to customers for each store. Only one container type will be considered, which happens to be the top-selling type and accounts for approximately 85% of the revenue generated at these stores. The stores under consideration are canteens affiliated with one of the largest chemical companies in Germany. They operate from Monday to Friday, starting from 8 am and closing at 4 pm.
The objective is to generate accurate daily forecasts for a two-week period ahead. These forecasts hold significant importance for Vytal as they assist in replenishment and inventory rebalancing operations. As we will explore further, the supply (returns) and customer demand are well balanced, resulting in infrequent inventory replenishment requirements. However, there are occasions when Vytal needs to transfer containers from one store to another to address high inventory levels.
First, I load the realized demand and containers returned from each store. In the context of Vytal, a container that was issued to a customer is referred to as a “check-out” and a container that was returned by a customer is called a “check-in”. I will use their terminology in this project.
set.seed(187)
df_store1_checkouts <- read.csv("store1 checkouts.csv",sep=";")
str(df_store1_checkouts)
## 'data.frame': 684 obs. of 15 variables:
## $ date : chr "07.09.20" "08.09.20" "09.09.20" "10.09.20" ...
## $ checkouts : int 33 74 92 88 58 0 0 109 122 130 ...
## $ weekends : int 0 0 0 0 0 1 1 0 0 0 ...
## $ public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ before_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ after_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_easter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_summer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_autumn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_winter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pre_school_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cov_law : int 0 0 0 0 0 0 0 0 0 0 ...
## $ christmas_season : int 0 0 0 0 0 0 0 0 0 0 ...
## $ public_holiday_week_after: int 0 0 0 0 0 0 0 0 0 0 ...
df_store2_checkouts <- read.csv("store2 checkouts.csv",sep=";")
str(df_store2_checkouts)
## 'data.frame': 677 obs. of 16 variables:
## $ date : chr "10.09.20" "11.09.20" "12.09.20" "13.09.20" ...
## $ checkouts : int 2 1 0 0 24 30 56 26 39 0 ...
## $ weekends : int 0 0 1 1 0 0 0 0 0 1 ...
## $ public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ before_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ after_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_easter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_summer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_autumn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_winter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pre_school_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cov_law : int 0 0 0 0 0 0 0 0 0 0 ...
## $ christmas_season : int 0 0 0 0 0 0 0 0 0 0 ...
## $ public_holiday_week_after: int 0 0 0 0 0 0 0 0 0 0 ...
## $ tracking_wrong : int 0 0 0 0 0 0 0 0 0 0 ...
df_store3_checkouts <- read.csv("store3 checkouts.csv",sep=";")
str(df_store3_checkouts)
## 'data.frame': 677 obs. of 16 variables:
## $ date : chr "14.09.20" "15.09.20" "16.09.20" "17.09.20" ...
## $ checkouts : int 28 47 60 46 71 0 0 102 120 125 ...
## $ weekends : int 0 0 0 0 0 1 1 0 0 0 ...
## $ public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ before_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ after_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_easter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_summer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_autumn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_winter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pre_school_days : int 0 0 0 0 0 0 0 1 1 1 ...
## $ cov_law : int 0 0 0 0 0 0 0 0 0 0 ...
## $ christmas_season : int 0 0 0 0 0 0 0 0 0 0 ...
## $ public_holiday_week_after: int 0 0 0 0 0 0 0 0 0 0 ...
## $ tracking_wrong : int 0 0 0 0 0 0 0 0 0 0 ...
For the check-outs, I have 684 observations (days) starting from 07.09.2020 for the first store. As for the second and third stores, I have 677 observations starting from 10.09.2020 and 14.09.2020, respectively.
df_store1_checkins <- read.csv("store1 checkins.csv",sep=";")
str(df_store1_checkins)
## 'data.frame': 683 obs. of 15 variables:
## $ date : chr "08.09.20" "09.09.20" "10.09.20" "11.09.20" ...
## $ checkins : int 30 46 86 64 0 0 73 97 118 120 ...
## $ weekends : int 0 0 0 0 1 1 0 0 0 0 ...
## $ public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ before_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ after_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_easter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_summer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_autumn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_winter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pre_school_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cov_law : int 0 0 0 0 0 0 0 0 0 0 ...
## $ christmas_season : int 0 0 0 0 0 0 0 0 0 0 ...
## $ public_holiday_week_after: int 0 0 0 0 0 0 0 0 0 0 ...
df_store2_checkins <- read.csv("store2 checkins.csv",sep=";")
str(df_store2_checkins)
## 'data.frame': 677 obs. of 16 variables:
## $ date : chr "10.09.20" "11.09.20" "12.09.20" "13.09.20" ...
## $ checkins : int 2 1 0 0 0 15 30 42 27 0 ...
## $ weekends : int 0 0 1 1 0 0 0 0 0 1 ...
## $ public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ before_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ after_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_easter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_summer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_autumn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_winter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pre_school_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cov_law : int 0 0 0 0 0 0 0 0 0 0 ...
## $ christmas_season : int 0 0 0 0 0 0 0 0 0 0 ...
## $ public_holiday_week_after: int 0 0 0 0 0 0 0 0 0 0 ...
## $ tracking_wrong : int 0 0 0 0 0 0 0 0 0 0 ...
df_store3_checkins <- read.csv("store3 checkins.csv",sep=";")
str(df_store3_checkins)
## 'data.frame': 677 obs. of 16 variables:
## $ date : chr "14.09.20" "15.09.20" "16.09.20" "17.09.20" ...
## $ checkins : int 3 17 38 55 38 0 0 64 72 104 ...
## $ weekends : int 0 0 0 0 0 1 1 0 0 0 ...
## $ public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bridge_days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ before_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ after_public_holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_easter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_summer : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_autumn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ school_holidays_winter : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pre_school_days : int 0 0 0 0 0 0 0 1 1 1 ...
## $ cov_law : int 0 0 0 0 0 0 0 0 0 0 ...
## $ christmas_season : int 0 0 0 0 0 0 0 0 0 0 ...
## $ public_holiday_week_after: int 0 0 0 0 0 0 0 0 0 0 ...
## $ tracking_wrong : int 0 0 0 0 0 0 0 0 0 0 ...
For the checkins, I have 683 observations (days) starting from 08.09.2020 for the first store. As for the second and third stores, I have 677 observations starting from 09.10.2020. The reason for this difference of 684 observations for checkouts and 683 for check-ins is that the check-outs took place first, and for Store 1, there were no checkins on the first day when Vytal was in operation.
I have already performed some pre-processing and feature engineering on the data, which will be explained below. However, further feature engineering will be conducted later in the project. Each dataset contains up to 16 columns with the following explanations:
| Variable | Explanation |
|---|---|
| date | The corresponding day when the data was collected. |
| checkins | The number of check-ins made on that day. |
| checkouts | The number of check-outs made on that day. |
| weekends | 1 if the day falls on a weekend, 0 otherwise. |
| public_holiday | 1 if the day is a public holiday, 0 otherwise. |
| bridge_days | 1 if the day is a possible bridging day, 0 otherwise. |
| before_public_holiday | 1 if the day is the day before a public holiday, 0 otherwise. |
| after_public_holiday | 1 if the day is the day after a public holiday, 0 otherwise. |
| school_holidays_easter | 1 if the day is during the Easter school holidays in Germany, 0 otherwise. |
| school_holidays_summer | 1 if the day is during the summer school holidays in Germany, 0 otherwise. |
| school_holidays_autumn | 1 if the day is during the autumn school holidays in Germany, 0 otherwise. |
| school_holidays_winter | 1 if the day is during the winter school holidays in Germany, 0 otherwise. |
| pre_school_days | 1 if the day falls within the 14-day period before school holidays. |
| cov_law | 1 if local COVID-19 regulations were in effect, affecting the business in terms of selling restrictions. |
| christmas_season | 1 if the day is close to Christmas or New Year’s Eve, 0 otherwise. |
| public_holiday_week_after | Used as a sanity check, as discussed by Huber & Stuckenschmidt (2020). |
| tracking_wrong | Relevant only for store 2 and 3, indicating a problem with scanning devices between the two stores. |
First of all, lets plot the check-outs and check-ins for all stores. We have to convert the date column to a date datatype and index.
I will use the most recent 6 weeks (30 days) of each data set as the test data, and the remaining data will be used as the training data set. The exploratory data analysis will be done with the training set
# I transform the string column to an appropriate date column
df_store1_checkouts$date <- as.Date(df_store1_checkouts$date, format="%d.%m.%y")
df_store2_checkouts$date <- as.Date(df_store2_checkouts$date, format="%d.%m.%y")
df_store3_checkouts$date <- as.Date(df_store3_checkouts$date, format="%d.%m.%y")
df_store1_checkins$date <- as.Date(df_store1_checkins$date, format="%d.%m.%y")
df_store2_checkins$date <- as.Date(df_store2_checkins$date, format="%d.%m.%y")
df_store3_checkins$date <- as.Date(df_store3_checkins$date, format="%d.%m.%y")
df_store1_checkouts <- df_store1_checkouts[df_store1_checkouts$weekends == 0, ]
df_store2_checkouts <- df_store2_checkouts[df_store2_checkouts$weekends == 0, ]
df_store3_checkouts <- df_store3_checkouts[df_store3_checkouts$weekends == 0, ]
df_store1_checkins <- df_store1_checkins[df_store1_checkins$weekends == 0, ]
df_store2_checkins <- df_store2_checkins[df_store2_checkins$weekends == 0, ]
df_store3_checkins <- df_store3_checkins[df_store3_checkins$weekends == 0, ]
# Split of training and test data
df_store1_checkouts_test <- tail(df_store1_checkouts,10)
df_store1_checkouts_training <- head(df_store1_checkouts,-10)
df_store2_checkouts_test <- tail(df_store2_checkouts,10)
df_store2_checkouts_training <- head(df_store2_checkouts,-10)
df_store3_checkouts_test <- tail(df_store3_checkouts,10)
df_store3_checkouts_training <- head(df_store3_checkouts,-10)
df_store1_checkins_test <- tail(df_store1_checkins,10)
df_store1_checkins_training <- head(df_store1_checkins,-10)
df_store2_checkins_test <- tail(df_store2_checkins,10)
df_store2_checkins_training <- head(df_store2_checkins,-10)
df_store3_checkins_test <- tail(df_store3_checkins,10)
df_store3_checkins_training <- head(df_store3_checkins,-10)
# Merged dataframe makes it easier to plot the data
df_merged_store_1 <- merge(df_store1_checkouts_training[, c("date", "checkouts")],df_store1_checkins_training[, c("date", "checkins")], by = "date", all = TRUE)
# Fill the missing value for checkins with 0
df_merged_store_1$checkins[is.na(df_merged_store_1$checkins)] <- 0
df_merged_store_2 <- merge(df_store2_checkouts_training[, c("date", "checkouts")],df_store2_checkins_training[, c("date", "checkins")], by = "date", all = TRUE)
df_merged_store_3 <- merge(df_store3_checkouts_training[, c("date", "checkouts")],df_store3_checkins_training[, c("date", "checkins")], by = "date", all = TRUE)
Based on these plots, we observe several patterns. Firstly, it appears that the daily checkouts and checkins are fairly balanced. Secondly, there is a significant decrease in the number of both checkins and checkouts around Christmas and New Year’s Eve. Thirdly, during the summer of 2021, we observe a sharp decrease in checkouts and checkins for store 2, while store 3 shows very high levels. To account for this event, we introduced a dummy variable as mentioned above. It is worth noting that the scanning devices from store 3 were used in both store 2 and store 3, as they are located on the same campus. Lastly, we notice that all stores are closed on specific days, most likely weekends, which is further supported by examining the percentage of checkouts and checkins made during weekends.
mask <- df_store1_checkouts_training$weekends == 1
(sum(df_store1_checkouts_training$checkouts[mask]) / sum(df_store1_checkouts_training$checkouts))*100
## [1] 0
mask <- df_store1_checkins_training$weekends == 1
(sum(df_store1_checkins_training$checkins[mask]) / sum(df_store1_checkins_training$checkins))*100
## [1] 0
mask <- df_store2_checkouts_training$weekends == 1
(sum(df_store2_checkouts_training$checkouts[mask]) / sum(df_store2_checkouts_training$checkouts))*100
## [1] 0
mask <- df_store2_checkins_training$weekends == 1
(sum(df_store2_checkins_training$checkins[mask]) / sum(df_store2_checkins_training$checkins))*100
## [1] 0
mask <- df_store3_checkouts_training$weekends == 1
(sum(df_store3_checkouts_training$checkouts[mask]) / sum(df_store3_checkouts_training$checkouts))*100
## [1] 0
mask <- df_store3_checkins_training$weekends == 1
(sum(df_store3_checkins_training$checkins[mask]) / sum(df_store3_checkins_training$checkins))*100
## [1] 0
Based on the data, one can observe that there are only check-outs and check-ins for store 1 that occurred during the weekend. However, the number of such transactions is less than 1% of the total check-outs and check-ins, respectively. There could be various reasons for these weekend transactions, such as special opening hours, scans made during training sessions for staff, inventory checks, or other factors. However, it is important to note that this is not a regular pattern and can also be confirmed based on the accessible opening hours from the firm’s master data database, and therefore I exclude the weekends for all three stores in my analysis.
Next, we want to analyze if there are weekly patterns.
day_order <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
width = 0.5
# Store 1
ggplot(df_store1_checkouts_training, aes(x = dayoftheweek, y = checkouts, fill = "Store 1")) +
geom_boxplot(position = position_dodge(width = width), fill = orange) +
labs(x = "Day of the Week", y = "Check-outs", title = "Check-outs per Day Store 1") +
scale_x_discrete(limits = day_order) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
ggplot(df_store1_checkins_training, aes(x = dayoftheweek, y = checkins, fill = "Store 1")) +
geom_boxplot(position = position_dodge(width = width), fill = blue) +
labs(x = "Day of the Week", y = "Check-ins", title = "Check-ins per Day Store 1") +
scale_x_discrete(limits = day_order) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
For Store 1, we observed only one outlier above the median for check-ins and check-outs. However, there are several outliers where the check-ins or check-outs are zero. These occurrences correspond to public holidays. Additionally, there are outliers below the median that are associated with the period around Christmas and New Year’s Eve. Furthermore, it is noticeable that Friday is the weakest day of the week in terms of transactions. This finding is logical as people may be working from home, have the day off, or prefer not to borrow a container over the weekend, among other reasons.
For Store 2, it is also evident that there are three outliers above the median, two for check-outs, and one for check-ins. Similarly, noticeable outliers are associated with public holidays and the Christmas period. Moreover, outliers can be attributed to the period when the scanning devices from Store 3 were used in Store 2. Additionally, Friday continues to be the least busy day of the week for this store as well.
For Store 3, the pattern observed is slightly different from the other stores. There are several outliers above the median, which can be attributed to scanning issues at Store 2. Additionally, outliers are present during public holidays when the store is closed and during the Christmas period when only a few customers visit the store. Similar to the other stores, Fridays remain the weakest day of the week for Store 3.
The boxplots provided above highlight a consistent weekly pattern for all three stores, with a noticeable decrease in volume on Fridays and a gradual increase throughout the week. This pattern seems to be significant and can be captured using feature engineering techniques. One approach could be developing a feature using radial basis functions to represent this weekly pattern accurately.
The next question that arises is whether there might be multiple seasonal patterns in the data. In addition to the weekly pattern, it is important to analyze if there are any monthly or annual seasonality patterns. However, due to the limitations of our dataset, it is challenging to analyze annual patterns comprehensively. Nevertheless, it is still worthwhile to explore possible monthly patterns in the data.
To assess the presence of multiple seasonal patterns, I will employ the multiple seasonal decomposition algorithm proposed by Bandara et al. (2021), which is implemented in the forecast package. This decomposition approach can help identify and analyze any potential monthly patterns in the data.
# For illustrational purposes I hide the x-axis labels
x <- ts(df_store1_checkouts_training$checkouts, start = min(df_store1_checkouts_training$date), frequency=5)
x <- msts(x, seasonal.periods = c(5, 365.25*5/7/12))
mstl(x) %>% autoplot() + labs(title="Store 1 Check-outs") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(),plot.title = element_text(hjust = 0.5))
x <- ts(df_store1_checkins_training$checkins, start = min(df_store1_checkins_training$date), frequency=5)
x <- msts(x, seasonal.periods = c(5, 365.25*5/7/12))
mstl(x) %>% autoplot() + labs(title="Store 1 Check-ins") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(),plot.title = element_text(hjust = 0.5))
x <- ts(df_store2_checkouts_training$checkouts, start = min(df_store2_checkouts_training$date), frequency=5)
x <- msts(x, seasonal.periods = c(5, 365.25*5/7/12))
mstl(x) %>% autoplot() + labs(title="Store 2 Check-outs") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.title = element_text(hjust = 0.5))
x <- ts(df_store2_checkins_training$checkins, start = min(df_store2_checkins_training$date), frequency=5)
x <- msts(x, seasonal.periods = c(5, 365.25*5/7/12))
mstl(x) %>% autoplot() + labs(title="Store 2 Check-ins") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.title = element_text(hjust = 0.5))
x <- ts(df_store3_checkouts_training$checkouts, start = min(df_store3_checkouts_training$date), frequency=5)
x <- msts(x, seasonal.periods = c(5, 365.25*5/7/12))
mstl(x) %>% autoplot() + labs(title="Store 3 Check-outs") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.title = element_text(hjust = 0.5))
x <- ts(df_store3_checkins_training$checkins, start = min(df_store3_checkins_training$date), frequency=5)
x <- msts(x, seasonal.periods = c(5, 365.25*5/7/12))
mstl(x) %>% autoplot() + labs(title="Store 3 Check-ins") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.title = element_text(hjust = 0.5))
The figures above depict the seasonal decomposition for all six time series. In these graphs, we can once again observe a weekly pattern as well as a monthly pattern. These patterns will be significant during the feature engineering phase. Furthermore, the data exhibits an initial increasing trend followed by a decreasing trend. Additionally, noticeable outliers can be observed during public holidays. In conclusion, we are faced with the issue of multi-seasonality, which we will address in the feature engineering stage.
In order to incorporate both weekly and monthly patterns in our analysis, we can utilize radial basis functions (RBFs) (see, e.g., Sheta & De Jong, 2001). One of the commonly used RBFs is the Gaussian kernel, defined as follows:
\[ \phi_j(x_{i}) = \exp\bigg[-\frac {(x_{i} - c_j)^2}{2 h_j^2}\bigg], \] where \(x_i\) is the day of the week or day of the month, \(c_j\) are the centers of the radial basis functions, and \(h_j^2\) width of the kernel j.
For the weekly pattern, we require five kernels, one for each business day of the week. The center \(c_j\) for each kernel is set to the respective day of the week (e.g., \(c_j = j\) for Monday through Friday).
For the monthly pattern, we can have up to 31 kernels, considering the maximum number of days in a month. However, it’s important to note that not all months have 31 days obviously. Similarly, the center \(c_j\) for each kernel is set to the respective day of the month.
By using these RBFs, we can capture both weekly and monthly patterns in our analysis effectively. Note that there are optimization algorithms to find the “optimal” parameters for the RBF. However, here I keep it simple and use a width of based on some simple grid search that I did.
# Compute day of the week & month
df_store1_checkouts_training$dayoftheweek <- as.factor(match(df_store1_checkouts_training$dayoftheweek, weekdays(df_store1_checkouts_training$date)))
df_store2_checkouts_training$dayoftheweek <- as.factor(match(df_store2_checkouts_training$dayoftheweek, weekdays(df_store2_checkouts_training$date)))
df_store3_checkouts_training$dayoftheweek <- as.factor(match(df_store3_checkouts_training$dayoftheweek, weekdays(df_store3_checkouts_training$date)))
df_store1_checkins_training$dayoftheweek <- as.factor(match(df_store1_checkins_training$dayoftheweek, weekdays(df_store1_checkins_training$date)))
df_store2_checkins_training$dayoftheweek <- as.factor(match(df_store2_checkins_training$dayoftheweek, weekdays(df_store2_checkins_training$date)))
df_store3_checkins_training$dayoftheweek <- as.factor(match(df_store3_checkins_training$dayoftheweek, weekdays(df_store3_checkins_training$date)))
df_store1_checkouts_training$dayofthemonth <- as.factor(day(df_store1_checkouts_training$date))
df_store2_checkouts_training$dayofthemonth <- as.factor(day(df_store2_checkouts_training$date))
df_store3_checkouts_training$dayofthemonth <- as.factor(day(df_store3_checkouts_training$date))
df_store1_checkins_training$dayofthemonth <- as.factor(day(df_store1_checkins_training$date))
df_store2_checkins_training$dayofthemonth <- as.factor(day(df_store2_checkins_training$date))
df_store3_checkins_training$dayofthemonth <- as.factor(day(df_store3_checkins_training$date))
df_store1_checkouts_training$month <- month(df_store1_checkouts_training$date)
df_store2_checkouts_training$month <- month(df_store2_checkouts_training$date)
df_store3_checkouts_training$month <- month(df_store3_checkouts_training$date)
df_store1_checkins_training$month <- month(df_store1_checkins_training$date)
df_store2_checkins_training$month <- month(df_store2_checkins_training$date)
df_store3_checkins_training$month <- month(df_store3_checkins_training$date)
# Radial basic function
width <- 1 # Width parameter
# Compute repeating basis functions
compute_rbf <- function(df_data, day_series, centers, h, saison = "weekly") {
n <- length(day_series)
x = 0
if (saison == "weekly"){
k <- 5
}
else if(saison == "monthly") {
k <- 31
}
df_with_rbf <- data.frame(df_data)
Phi <- matrix(0, n, k)
for (j in 1:k) {
for (i in 1:n) {
if (saison == "weekly")
{
Phi[i, j] <- exp(-(as.integer(df_with_rbf$dayoftheweek[i]) - centers[j])^2 / (2 * h^2))
}
else if (saison == "monthly")
{
# df_with_rbf$month = index of the centers
Phi[i, j] <- exp(-(as.integer(df_with_rbf$dayofthemonth[i]) - centers[j])^2 / (2 * h^2))
}
}
}
if (saison == "weekly") { colnames(Phi) <- paste0("DayOfTheWeek", seq(1, 5)) }
else if (saison == "monthly")
{
colnames(Phi) <- paste0("DayOfTheMonth", seq(1, 31))
}
df_with_rbf <- cbind(df_with_rbf, Phi)
if (saison == "weekly")
{
df_with_rbf <- df_with_rbf[, -which(names(df_with_rbf) == "dayoftheweek")]
df_with_rbf <- df_with_rbf[, -which(names(df_with_rbf) == "month")]
}
else if (saison == "monthly")
{
df_with_rbf <- df_with_rbf[, -which(names(df_with_rbf) == "dayofthemonth")]
df_with_rbf <- df_with_rbf[, -which(names(df_with_rbf) == "month")]
}
return(data_frame(df_with_rbf))
}
df_weekly <- compute_rbf(df_store1_checkouts_training,df_store1_checkouts_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
model <- lm(checkouts ~ DayOfTheWeek2 + DayOfTheWeek3 + DayOfTheWeek4 + DayOfTheWeek5, data = df_weekly)
predictions <- predict(model, newdata = df_weekly)
summary(model)
##
## Call:
## lm(formula = checkouts ~ DayOfTheWeek2 + DayOfTheWeek3 + DayOfTheWeek4 +
## DayOfTheWeek5, data = df_weekly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.96 -26.88 10.18 38.16 127.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.7739 19.6327 7.985 1.07e-14 ***
## DayOfTheWeek2 37.7202 30.7872 1.225 0.22111
## DayOfTheWeek3 -0.9255 25.2531 -0.037 0.97078
## DayOfTheWeek4 40.1956 31.1696 1.290 0.19783
## DayOfTheWeek5 -45.7393 17.4468 -2.622 0.00903 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.7 on 475 degrees of freedom
## Multiple R-squared: 0.1323, Adjusted R-squared: 0.125
## F-statistic: 18.1 on 4 and 475 DF, p-value: 7.543e-14
plot_data <- data.frame( date = df_weekly$date, actual = df_weekly$checkouts, predicted = predictions )
plot(actual ~ date, data = plot_data, type = "l", col = "blue", xlab =
"Date", ylab = "Checkout", main = "Actual vs Predicted Checkout")
lines(predicted ~ date, data = plot_data, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue",
"red"), lty = 1)
We observe that the radial basis function effectively models the weekly pattern in the data. It’s important to note that at this stage, we are assessing the in-sample accuracy, which evaluates how well the model fits the observed data. Later on, we will make out-of-sample predictions to evaluate the predictive performance of different algorithms.
df_monthly <- compute_rbf(df_store1_checkouts_training,df_store1_checkouts_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_merged <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
model <- lm(checkouts ~ DayOfTheWeek2 + DayOfTheWeek3 + DayOfTheWeek4 + DayOfTheWeek5 +
DayOfTheMonth2 + DayOfTheMonth3 + DayOfTheMonth4 + DayOfTheMonth5 +
DayOfTheMonth6 + DayOfTheMonth7 + DayOfTheMonth8 + DayOfTheMonth9 +
DayOfTheMonth10 + DayOfTheMonth11 + DayOfTheMonth12 + DayOfTheMonth13 +
DayOfTheMonth14 + DayOfTheMonth15 + DayOfTheMonth16 + DayOfTheMonth17 +
DayOfTheMonth18 + DayOfTheMonth19 + DayOfTheMonth20 + DayOfTheMonth21 +
DayOfTheMonth22 + DayOfTheMonth23 + DayOfTheMonth24 + DayOfTheMonth25 +
DayOfTheMonth26 + DayOfTheMonth27 + DayOfTheMonth28 + DayOfTheMonth29 +
DayOfTheMonth30 + DayOfTheMonth31,
data = df_merged)
summary(model)
##
## Call:
## lm(formula = checkouts ~ DayOfTheWeek2 + DayOfTheWeek3 + DayOfTheWeek4 +
## DayOfTheWeek5 + DayOfTheMonth2 + DayOfTheMonth3 + DayOfTheMonth4 +
## DayOfTheMonth5 + DayOfTheMonth6 + DayOfTheMonth7 + DayOfTheMonth8 +
## DayOfTheMonth9 + DayOfTheMonth10 + DayOfTheMonth11 + DayOfTheMonth12 +
## DayOfTheMonth13 + DayOfTheMonth14 + DayOfTheMonth15 + DayOfTheMonth16 +
## DayOfTheMonth17 + DayOfTheMonth18 + DayOfTheMonth19 + DayOfTheMonth20 +
## DayOfTheMonth21 + DayOfTheMonth22 + DayOfTheMonth23 + DayOfTheMonth24 +
## DayOfTheMonth25 + DayOfTheMonth26 + DayOfTheMonth27 + DayOfTheMonth28 +
## DayOfTheMonth29 + DayOfTheMonth30 + DayOfTheMonth31, data = df_merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185.347 -26.685 9.026 37.392 122.701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.7783 58.8006 2.309 0.0214 *
## DayOfTheWeek2 40.4677 31.7770 1.273 0.2035
## DayOfTheWeek3 -3.0189 26.0513 -0.116 0.9078
## DayOfTheWeek4 43.2803 32.1617 1.346 0.1791
## DayOfTheWeek5 -46.4318 17.9940 -2.580 0.0102 *
## DayOfTheMonth2 15.1259 92.6185 0.163 0.8703
## DayOfTheMonth3 0.1486 90.3240 0.002 0.9987
## DayOfTheMonth4 34.4400 139.6221 0.247 0.8053
## DayOfTheMonth5 5.7996 124.2540 0.047 0.9628
## DayOfTheMonth6 -18.5895 146.8068 -0.127 0.8993
## DayOfTheMonth7 44.3800 131.6628 0.337 0.7362
## DayOfTheMonth8 -33.6954 141.4736 -0.238 0.8119
## DayOfTheMonth9 49.7257 135.3926 0.367 0.7136
## DayOfTheMonth10 -16.2846 142.7859 -0.114 0.9093
## DayOfTheMonth11 7.4756 141.9172 0.053 0.9580
## DayOfTheMonth12 48.5486 146.7253 0.331 0.7409
## DayOfTheMonth13 -52.8326 145.0847 -0.364 0.7159
## DayOfTheMonth14 78.2187 143.2977 0.546 0.5854
## DayOfTheMonth15 -79.8894 140.1491 -0.570 0.5689
## DayOfTheMonth16 101.4602 140.2557 0.723 0.4698
## DayOfTheMonth17 -59.4977 141.3210 -0.421 0.6740
## DayOfTheMonth18 22.4547 143.6343 0.156 0.8758
## DayOfTheMonth19 27.6922 145.8836 0.190 0.8495
## DayOfTheMonth20 -10.0298 145.7322 -0.069 0.9452
## DayOfTheMonth21 28.4262 142.8948 0.199 0.8424
## DayOfTheMonth22 -29.2081 140.3137 -0.208 0.8352
## DayOfTheMonth23 68.1720 140.0315 0.487 0.6266
## DayOfTheMonth24 -76.8841 140.9209 -0.546 0.5856
## DayOfTheMonth25 91.3442 143.0462 0.639 0.5234
## DayOfTheMonth26 -72.4164 143.6838 -0.504 0.6145
## DayOfTheMonth27 96.2107 142.5846 0.675 0.5002
## DayOfTheMonth28 -73.8100 133.2464 -0.554 0.5799
## DayOfTheMonth29 57.0557 124.7772 0.457 0.6477
## DayOfTheMonth30 -19.7485 96.8494 -0.204 0.8385
## DayOfTheMonth31 8.1361 75.5011 0.108 0.9142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.1 on 445 degrees of freedom
## Multiple R-squared: 0.1478, Adjusted R-squared: 0.08268
## F-statistic: 2.27 on 34 and 445 DF, p-value: 9.456e-05
predictions <- predict(model, newdata = df_merged)
plot_data <- data.frame( date = df_merged$date, actual = df_merged$checkouts, predicted = predictions )
plot(actual ~ date, data = plot_data, type = "l", col = "blue", xlab =
"Date", ylab = "Checkout", main = "Actual vs Predicted Checkout")
lines(predicted ~ date, data = plot_data, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue",
"red"), lty = 1)
It is important to note that in this OLS model, none of the days are statistically significant. However, this does not necessarily mean that these features lack predictive power. The model has become more flexible in terms of capturing seasonality patterns.
Moving forward, I will include additional variables related to holidays and public holidays. These variables will provide more information about the impact of specific holidays on the check-out volume, allowing one to account for any seasonality associated with those events.
model <- lm(checkouts ~ DayOfTheWeek2 + DayOfTheWeek3 + DayOfTheWeek4 + DayOfTheWeek5 +
DayOfTheMonth2 + DayOfTheMonth3 + DayOfTheMonth4 + DayOfTheMonth5 +
DayOfTheMonth6 + DayOfTheMonth7 + DayOfTheMonth8 + DayOfTheMonth9 +
DayOfTheMonth10 + DayOfTheMonth11 + DayOfTheMonth12 + DayOfTheMonth13 +
DayOfTheMonth14 + DayOfTheMonth15 + DayOfTheMonth16 + DayOfTheMonth17 +
DayOfTheMonth18 + DayOfTheMonth19 + DayOfTheMonth20 + DayOfTheMonth21 +
DayOfTheMonth22 + DayOfTheMonth23 + DayOfTheMonth24 + DayOfTheMonth25 +
DayOfTheMonth26 + DayOfTheMonth27 + DayOfTheMonth28 + DayOfTheMonth29 +
DayOfTheMonth30 + DayOfTheMonth31 + public_holiday + bridge_days + before_public_holiday + after_public_holiday + christmas_season,
data = df_merged)
summary(model)
##
## Call:
## lm(formula = checkouts ~ DayOfTheWeek2 + DayOfTheWeek3 + DayOfTheWeek4 +
## DayOfTheWeek5 + DayOfTheMonth2 + DayOfTheMonth3 + DayOfTheMonth4 +
## DayOfTheMonth5 + DayOfTheMonth6 + DayOfTheMonth7 + DayOfTheMonth8 +
## DayOfTheMonth9 + DayOfTheMonth10 + DayOfTheMonth11 + DayOfTheMonth12 +
## DayOfTheMonth13 + DayOfTheMonth14 + DayOfTheMonth15 + DayOfTheMonth16 +
## DayOfTheMonth17 + DayOfTheMonth18 + DayOfTheMonth19 + DayOfTheMonth20 +
## DayOfTheMonth21 + DayOfTheMonth22 + DayOfTheMonth23 + DayOfTheMonth24 +
## DayOfTheMonth25 + DayOfTheMonth26 + DayOfTheMonth27 + DayOfTheMonth28 +
## DayOfTheMonth29 + DayOfTheMonth30 + DayOfTheMonth31 + public_holiday +
## bridge_days + before_public_holiday + after_public_holiday +
## christmas_season, data = df_merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.255 -29.686 2.363 28.265 121.902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.353 44.976 4.232 2.82e-05 ***
## DayOfTheWeek2 26.583 24.308 1.094 0.274722
## DayOfTheWeek3 -8.031 19.997 -0.402 0.688182
## DayOfTheWeek4 39.836 24.588 1.620 0.105924
## DayOfTheWeek5 -44.680 13.832 -3.230 0.001330 **
## DayOfTheMonth2 -17.072 70.628 -0.242 0.809107
## DayOfTheMonth3 10.484 68.899 0.152 0.879132
## DayOfTheMonth4 3.104 106.675 0.029 0.976800
## DayOfTheMonth5 -6.205 94.684 -0.066 0.947782
## DayOfTheMonth6 -17.781 111.785 -0.159 0.873690
## DayOfTheMonth7 10.979 100.116 0.110 0.912730
## DayOfTheMonth8 -41.707 107.575 -0.388 0.698423
## DayOfTheMonth9 27.350 102.965 0.266 0.790657
## DayOfTheMonth10 -31.978 108.629 -0.294 0.768612
## DayOfTheMonth11 -9.813 108.044 -0.091 0.927672
## DayOfTheMonth12 21.751 111.861 0.194 0.845916
## DayOfTheMonth13 -49.782 110.607 -0.450 0.652871
## DayOfTheMonth14 50.143 109.274 0.459 0.646548
## DayOfTheMonth15 -82.664 106.577 -0.776 0.438388
## DayOfTheMonth16 86.433 106.655 0.810 0.418148
## DayOfTheMonth17 -82.143 107.801 -0.762 0.446474
## DayOfTheMonth18 20.151 109.490 0.184 0.854065
## DayOfTheMonth19 1.346 111.422 0.012 0.990366
## DayOfTheMonth20 -25.633 111.199 -0.231 0.817801
## DayOfTheMonth21 6.248 109.325 0.057 0.954450
## DayOfTheMonth22 -13.400 107.497 -0.125 0.900855
## DayOfTheMonth23 -11.821 107.757 -0.110 0.912701
## DayOfTheMonth24 3.700 108.339 0.034 0.972774
## DayOfTheMonth25 28.568 109.902 0.260 0.795028
## DayOfTheMonth26 -76.549 109.914 -0.696 0.486517
## DayOfTheMonth27 103.913 109.099 0.952 0.341384
## DayOfTheMonth28 -124.141 101.502 -1.223 0.221971
## DayOfTheMonth29 100.179 95.016 1.054 0.292305
## DayOfTheMonth30 -103.811 73.898 -1.405 0.160791
## DayOfTheMonth31 70.882 57.656 1.229 0.219581
## public_holiday -154.297 12.801 -12.053 < 2e-16 ***
## bridge_days -93.046 28.944 -3.215 0.001402 **
## before_public_holiday -45.509 13.671 -3.329 0.000946 ***
## after_public_holiday -22.445 16.282 -1.378 0.168759
## christmas_season -90.123 13.455 -6.698 6.47e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.69 on 440 degrees of freedom
## Multiple R-squared: 0.5131, Adjusted R-squared: 0.4699
## F-statistic: 11.89 on 39 and 440 DF, p-value: < 2.2e-16
predictions <- predict(model, newdata = df_merged)
plot_data <- data.frame( date = df_merged$date, actual = df_merged$checkouts, predicted = predictions )
plot(actual ~ date, data = plot_data, type = "l", col = "blue", xlab =
"Date", ylab = "Checkout", main = "Actual vs Predicted Checkout")
lines(predicted ~ date, data = plot_data, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue",
"red"), lty = 1)
As expected, holidays have a significant impact on the check-outs and check-ins, as their values are consistently zero during those periods, making them perfectly predictable. However, the current model still faces challenges in capturing certain aspects of the time series. One factor that has not been utilized is the autoregressive properties of the time series, which I excluded here.
# We do not need this feature
df_merged <- df_merged %>% select(-weekends)
model <- lm(checkouts ~. -date,
data = df_merged)
summary(model)
##
## Call:
## lm(formula = checkouts ~ . - date, data = df_merged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.612 -28.003 2.362 27.724 114.161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 199.3839 44.4017 4.490 9.13e-06 ***
## public_holiday -149.2668 13.2163 -11.294 < 2e-16 ***
## bridge_days -98.2315 29.0483 -3.382 0.000786 ***
## before_public_holiday -44.1944 13.7309 -3.219 0.001385 **
## after_public_holiday -15.5769 16.8250 -0.926 0.355055
## school_holidays_easter -19.7465 12.2929 -1.606 0.108929
## school_holidays_summer -13.3441 6.6605 -2.003 0.045749 *
## school_holidays_autumn -30.7195 11.0277 -2.786 0.005576 **
## school_holidays_winter -19.6642 17.5774 -1.119 0.263879
## pre_school_days -16.4140 5.9166 -2.774 0.005772 **
## cov_law -12.2409 4.8489 -2.524 0.011944 *
## christmas_season -74.9670 21.7261 -3.451 0.000614 ***
## public_holiday_week_after 1.8072 12.9344 0.140 0.888946
## DayOfTheMonth2 -15.4569 69.6822 -0.222 0.824559
## DayOfTheMonth3 4.2008 67.9419 0.062 0.950727
## DayOfTheMonth4 5.4957 105.1338 0.052 0.958335
## DayOfTheMonth5 -1.0916 93.3449 -0.012 0.990675
## DayOfTheMonth6 -19.8301 110.1862 -0.180 0.857261
## DayOfTheMonth7 13.8084 98.7146 0.140 0.888819
## DayOfTheMonth8 -44.3215 106.0555 -0.418 0.676221
## DayOfTheMonth9 31.6028 101.5563 0.311 0.755810
## DayOfTheMonth10 -39.1723 107.2157 -0.365 0.715021
## DayOfTheMonth11 -5.3594 106.5838 -0.050 0.959920
## DayOfTheMonth12 21.4260 110.3566 0.194 0.846148
## DayOfTheMonth13 -47.1595 109.0258 -0.433 0.665555
## DayOfTheMonth14 50.6362 107.6972 0.470 0.638469
## DayOfTheMonth15 -81.4992 105.0382 -0.776 0.438231
## DayOfTheMonth16 89.3075 105.1208 0.850 0.396034
## DayOfTheMonth17 -87.6948 106.2464 -0.825 0.409604
## DayOfTheMonth18 23.0020 107.9516 0.213 0.831367
## DayOfTheMonth19 0.4881 109.9083 0.004 0.996458
## DayOfTheMonth20 -21.1304 109.7733 -0.192 0.847448
## DayOfTheMonth21 0.4513 107.8535 0.004 0.996663
## DayOfTheMonth22 -4.2962 106.0380 -0.041 0.967701
## DayOfTheMonth23 -16.7748 106.3143 -0.158 0.874700
## DayOfTheMonth24 5.5754 106.8717 0.052 0.958418
## DayOfTheMonth25 22.5633 108.5248 0.208 0.835398
## DayOfTheMonth26 -72.3514 108.4098 -0.667 0.504880
## DayOfTheMonth27 98.7824 107.6266 0.918 0.359221
## DayOfTheMonth28 -123.7823 100.0705 -1.237 0.216776
## DayOfTheMonth29 100.9560 93.7442 1.077 0.282111
## DayOfTheMonth30 -102.0405 72.9677 -1.398 0.162699
## DayOfTheMonth31 68.9736 57.0810 1.208 0.227574
## DayOfTheWeek2 26.5309 23.9661 1.107 0.268901
## DayOfTheWeek3 -6.9590 19.7734 -0.352 0.725056
## DayOfTheWeek4 39.5647 24.2639 1.631 0.103702
## DayOfTheWeek5 -44.0456 13.6788 -3.220 0.001379 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.02 on 433 degrees of freedom
## Multiple R-squared: 0.5347, Adjusted R-squared: 0.4852
## F-statistic: 10.82 on 46 and 433 DF, p-value: < 2.2e-16
predictions <- predict(model, newdata = df_merged)
plot_data <- data.frame( date = df_merged$date, actual = df_merged$checkouts, predicted = predictions )
plot(actual ~ date, data = plot_data, type = "l", col = "blue", xlab =
"Date", ylab = "Checkout", main = "Actual vs Predicted Checkout")
lines(predicted ~ date, data = plot_data, col = "red")
legend("topright", legend = c("Actual", "Predicted"), col = c("blue",
"red"), lty = 1)
After incorporating all the aforementioned features into the OLS model, one can observe an improvement in the model’s fit. Notably, the variable “pre_school_holidays,” which represents the period 2 weeks or 10 business days before the start of school holidays, has a significant negative effect on check-outs at a 5% confidence level. This negative effect can potentially be attributed to the 14-day loan period, where individuals anticipate going on holidays and realize they may not be able to return the containers on time, leading to a decrease in usage.
Moving forward, I will proceed with analyzing the out-of-sample forecast accuracy for all three time series using different algorithms, namely OLS, seasonal ARIMA, and random forest.
First of all, I prepare the test data set.
df_weekly <- compute_rbf(df_store1_checkouts_training,df_store1_checkouts_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store1_checkouts_training,df_store1_checkouts_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store1_checkouts_training <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store2_checkouts_training,df_store2_checkouts_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store2_checkouts_training,df_store2_checkouts_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store2_checkouts_training <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store3_checkouts_training,df_store3_checkouts_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store3_checkouts_training,df_store3_checkouts_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store3_checkouts_training <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store1_checkins_training,df_store1_checkins_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store1_checkins_training,df_store1_checkins_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store1_checkins_training <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store2_checkins_training,df_store2_checkins_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store2_checkins_training,df_store2_checkins_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store2_checkins_training <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store3_checkins_training,df_store3_checkins_training$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store3_checkins_training,df_store3_checkins_training$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store3_checkins_training <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_store1_checkouts_test$dayoftheweek <- weekdays(df_store1_checkouts_test$date)
df_store1_checkins_test$dayoftheweek <- weekdays(df_store1_checkins_test$date)
df_store2_checkouts_test$dayoftheweek <- weekdays(df_store2_checkouts_test$date)
df_store2_checkins_test$dayoftheweek <- weekdays(df_store2_checkins_test$date)
df_store3_checkouts_test$dayoftheweek <- weekdays(df_store3_checkouts_test$date)
df_store3_checkins_test$dayoftheweek <- weekdays(df_store3_checkins_test$date)
df_store1_checkouts_test$dayoftheweek <- as.factor(match(df_store1_checkouts_test$dayoftheweek, weekdays(df_store1_checkouts_test$date)))
df_store2_checkouts_test$dayoftheweek <- as.factor(match(df_store2_checkouts_test$dayoftheweek, weekdays(df_store2_checkouts_test$date)))
df_store3_checkouts_test$dayoftheweek <- as.factor(match(df_store3_checkouts_test$dayoftheweek, weekdays(df_store3_checkouts_training$date)))
df_store1_checkins_test$dayoftheweek <- as.factor(match(df_store1_checkins_test$dayoftheweek, weekdays(df_store1_checkins_test$date)))
df_store2_checkins_test$dayoftheweek <- as.factor(match(df_store2_checkins_test$dayoftheweek, weekdays(df_store2_checkins_test$date)))
df_store3_checkins_test$dayoftheweek <- as.factor(match(df_store3_checkins_test$dayoftheweek, weekdays(df_store3_checkins_test$date)))
df_store1_checkouts_test$dayofthemonth <- as.factor(day(df_store1_checkouts_test$date))
df_store2_checkouts_test$dayofthemonth <- as.factor(day(df_store2_checkouts_test$date))
df_store3_checkouts_test$dayofthemonth <- as.factor(day(df_store3_checkouts_test$date))
df_store1_checkins_test$dayofthemonth <- as.factor(day(df_store1_checkins_test$date))
df_store2_checkins_test$dayofthemonth <- as.factor(day(df_store2_checkins_test$date))
df_store3_checkins_test$dayofthemonth <- as.factor(day(df_store3_checkins_test$date))
df_store1_checkouts_test$month <- month(df_store1_checkouts_test$date)
df_store2_checkouts_test$month <- month(df_store2_checkouts_test$date)
df_store3_checkouts_test$month <- month(df_store3_checkouts_test$date)
df_store1_checkins_test$month <- month(df_store1_checkins_test$date)
df_store2_checkins_test$month <- month(df_store2_checkins_test$date)
df_store3_checkins_test$month <- month(df_store3_checkins_test$date)
df_weekly <- compute_rbf(df_store1_checkouts_test,df_store1_checkouts_test$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store1_checkouts_test,df_store1_checkouts_test$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store1_checkouts_test <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store2_checkouts_test,df_store2_checkouts_test$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store2_checkouts_test,df_store2_checkouts_test$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store2_checkouts_test <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store3_checkouts_test,df_store3_checkouts_test$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store3_checkouts_test,df_store3_checkouts_test$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store3_checkouts_test <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store1_checkins_test,df_store1_checkins_test$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store1_checkins_test,df_store1_checkins_test$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store1_checkins_test <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store2_checkins_test,df_store2_checkins_test$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store2_checkins_test,df_store2_checkins_test$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store2_checkins_test <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
df_weekly <- compute_rbf(df_store3_checkins_test,df_store3_checkins_test$dayoftheweek,c(1,2,3,4,5), width, "weekly")
df_weekly <- df_weekly[, -which(names(df_weekly) == "DayOfTheWeek1")]
df_monthly <- compute_rbf(df_store3_checkins_test,df_store3_checkins_test$dayofthemonth,c(seq(1,31)), width, "monthly")
df_merged <- merge(df_monthly, df_weekly[c("date", "DayOfTheWeek2","DayOfTheWeek3","DayOfTheWeek4","DayOfTheWeek5")], by = "date")
df_merged <- df_merged[, -which(names(df_merged) == "DayOfTheMonth1")]
df_store3_checkins_test <- df_merged[, -which(names(df_merged) == "dayoftheweek")]
I use the root mean squared error (RMSE) to evaluate each forecast.
model_store1_checkouts <- lm(checkouts ~.,data = df_store1_checkouts_training)
model_store2_checkouts <- lm(checkouts ~.,data = df_store2_checkouts_training)
model_store3_checkouts <- lm(checkouts ~.,data = df_store3_checkouts_training)
predictions_store1_checkouts <- predict(model, newdata = df_store1_checkouts_test)
predictions_store2_checkouts <- predict(model, newdata = df_store2_checkouts_test)
predictions_store2_checkouts <- predict(model, newdata = df_store3_checkouts_test)
model_store1_checkins <- lm(checkins ~.,data = df_store1_checkins_training)
model_store2_checkins <- lm(checkins ~.,data = df_store2_checkins_training)
model_store3_checkins <- lm(checkins ~.,data = df_store3_checkins_training)
predictions_store1_checkins <- predict(model, newdata = df_store1_checkins_test)
predictions_store2_checkins <- predict(model, newdata = df_store2_checkins_test)
predictions_store3_checkins <- predict(model, newdata = df_store3_checkins_test)
rmse_store1_ols_checkouts <- rmse(predictions_store1_checkouts, df_store1_checkouts_test$checkouts)
rmse_store2_ols_checkouts <- rmse(predictions_store2_checkouts, df_store2_checkouts_test$checkouts)
rmse_store3_ols_checkouts <- rmse(predictions_store2_checkouts, df_store3_checkouts_test$checkouts)
rmse_store1_ols_checkins<- rmse(predictions_store1_checkins, df_store1_checkins_test$checkins)
rmse_store2_ols_checkins <- rmse(predictions_store2_checkins, df_store2_checkins_test$checkins)
rmse_store3_ols_checkins <- rmse(predictions_store3_checkins, df_store3_checkins_test$checkins)
cat("RMSE OLS Store1 Check-outs: ",round(rmse_store1_ols_checkouts,2))
## RMSE OLS Store1 Check-outs: 73.38
cat("RMSE OLS Store2 Check-outs: ",round(rmse_store2_ols_checkouts,2))
## RMSE OLS Store2 Check-outs: 119.46
cat("RMSE OLS Store3 Check-outs: ",round(rmse_store3_ols_checkouts,2))
## RMSE OLS Store3 Check-outs: 149.75
cat("RMSE OLS Store1 Check-outs: ",round(rmse_store1_ols_checkins,2))
## RMSE OLS Store1 Check-outs: 69.65
cat("RMSE OLS Store2 Check-outs: ",round(rmse_store2_ols_checkins,2))
## RMSE OLS Store2 Check-outs: 129.89
cat("RMSE OLS Store3 Check-outs: ",round(rmse_store3_ols_checkins,2))
## RMSE OLS Store3 Check-outs: 147.09
To fit the ARIMA model, I utilize the stepwise algorithm (Hyndman & Khandakar, 2008), which is implemented in the forecast package. This algorithm is employed to determine the optimal parameters for both the seasonal and non-seasonal components of the ARIMA model. The ‘auto_arima()’ function is utilized to identify the best ARIMA model for the data, based on the Akaike Information Criterion (AIC). Additionally, it performs a Dickey-Fuller test to assess the need for differencing.
This process helps one select the most suitable ARIMA model by considering various factors, such as model complexity and goodness-of-fit measures
arima_model_store1_checkouts <- auto.arima(df_store1_checkouts_training$checkouts, seasonal = TRUE, test="adf", stepwise=TRUE)
arima_model_store2_checkouts <- auto.arima(df_store2_checkouts_training$checkouts, seasonal = TRUE, test="adf", stepwise=TRUE)
arima_model_store3_checkouts <- auto.arima(df_store3_checkouts_training$checkouts, seasonal = TRUE, test="adf", stepwise=TRUE)
store1_predicted_arima_checkouts <- forecast(arima_model_store1_checkouts, h = 10)
store2_predicted_arima_checkouts <- forecast(arima_model_store2_checkouts, h = 10)
store3_predicted_arima_checkouts <- forecast(arima_model_store3_checkouts, h = 10)
arima_model_store1_checkins <- auto.arima(df_store1_checkins_training$checkins, seasonal = TRUE, test="adf", stepwise=TRUE)
arima_model_store2_checkins <- auto.arima(df_store2_checkins_training$checkins, seasonal = TRUE, test="adf", stepwise=TRUE)
arima_model_store3_checkins <- auto.arima(df_store3_checkins_training$checkins, seasonal = TRUE, test="adf", stepwise=TRUE)
store1_predicted_arima_checkins <- forecast(arima_model_store1_checkins, h = 10)
store2_predicted_arima_checkins <- forecast(arima_model_store2_checkins, h = 10)
store3_predicted_arima_checkins <- forecast(arima_model_store3_checkins, h = 10)
rmse_store1_arima_checkouts <- rmse(store1_predicted_arima_checkouts$mean, df_store1_checkouts_test$checkouts)
rmse_store2_arima_checkouts <- rmse(store2_predicted_arima_checkouts$mean, df_store2_checkouts_test$checkouts)
rmse_store3_arima_checkouts <- rmse(store3_predicted_arima_checkouts$mean, df_store3_checkouts_test$checkouts)
rmse_store1_arima_checkins <- rmse(store1_predicted_arima_checkins$mean, df_store1_checkins_test$checkins)
rmse_store2_arima_checkins <- rmse(store2_predicted_arima_checkins$mean, df_store2_checkins_test$checkins)
rmse_store3_arima_checkins <- rmse(store3_predicted_arima_checkins$mean, df_store3_checkins_test$checkins)
cat("RMSE SARIMA Store1 Check-outs: ",round(rmse_store1_arima_checkouts,2))
## RMSE SARIMA Store1 Check-outs: 34.81
cat("RMSE SARIMA Store2 Check-outs: ",round(rmse_store2_arima_checkouts,2))
## RMSE SARIMA Store2 Check-outs: 28.92
cat("RMSE SARIMA Store3 Check-outs: ",round(rmse_store3_arima_checkouts,2))
## RMSE SARIMA Store3 Check-outs: 37.41
cat("RMSE SARIMA Store1 Check-ins: ",round(rmse_store1_arima_checkins,2))
## RMSE SARIMA Store1 Check-ins: 29.62
cat("RMSE SARIMA Store2 Check-ins: ",round(rmse_store2_arima_checkins,2))
## RMSE SARIMA Store2 Check-ins: 32.08
cat("RMSE SARIMA Store3 Check-ins: ",round(rmse_store3_arima_checkins,2))
## RMSE SARIMA Store3 Check-ins: 28.09
For the hyperparameter tuning of the random forest algorithm, I conducted several experiments primarily focused on adjusting the number of trees and the number of variables to randomly sample as candidates at each split. I will omit the details of those experiments in this response.
store1_checkouts <- df_store1_checkouts_training$checkouts
store2_checkouts <- df_store2_checkouts_training$checkouts
store3_checkouts <- df_store3_checkouts_training$checkouts
store1_checkins <- df_store1_checkins_training$checkins
store2_checkins <- df_store2_checkins_training$checkins
store3_checkins <- df_store3_checkins_training$checkins
# Exclude the dependent variable (checkins & checkouts) & date, and the weekend column from the exogenous variables dataframe
store1_checkouts_features <- df_store1_checkouts_training[, -(1:3)]
store2_checkouts_features <- df_store2_checkouts_training[, -(1:3)]
store3_checkouts_features <- df_store3_checkouts_training[, -(1:3)]
store1_checkins_features <- df_store1_checkins_training[, -(1:3)]
store2_checkins_features <- df_store2_checkins_training[, -(1:3)]
store3_checkins_features <- df_store3_checkins_training[, -(1:3)]
# Convert exogenous variables dataframe to a numeric matrix
store1_checkouts_matrix_features <- as.matrix(store1_checkouts_features)
store2_checkouts_matrix_features <- as.matrix(store2_checkouts_features)
store3_checkouts_matrix_features <- as.matrix(store3_checkouts_features)
store1_checkins_matrix_features <- as.matrix(store1_checkins_features)
store2_checkins_matrix_features <- as.matrix(store2_checkins_features)
store3_checkins_matrix_features <- as.matrix(store3_checkins_features)
# Fit the random forest models
rf_store1_checkouts <- randomForest(x = store1_checkouts_matrix_features, y = store1_checkouts, ntree = 1000, mtry = 6)
rf_store2_checkouts <- randomForest(x = store2_checkouts_matrix_features, y = store2_checkouts, ntree = 1000, mtry = 6)
rf_store3_checkouts <- randomForest(x = store3_checkouts_matrix_features, y = store3_checkouts, ntree = 1000, mtry = 6)
rf_store1_checkins <- randomForest(x = store1_checkins_matrix_features, y = store1_checkins, ntree = 1000, mtry = 6)
rf_store2_checkins <- randomForest(x = store2_checkins_matrix_features, y = store2_checkins, ntree = 1000, mtry = 6)
rf_store3_checkins <- randomForest(x = store3_checkins_matrix_features, y = store3_checkins, ntree = 1000, mtry = 6)
# predictions
store1_test_checkouts <- df_store1_checkouts_test$checkouts
store2_test_checkouts <- df_store2_checkouts_test$checkouts
store3_test_checkouts <- df_store3_checkouts_test$checkouts
store1_test_checkins <- df_store1_checkins_test$checkins
store2_test_checkins <- df_store2_checkins_test$checkins
store3_test_checkins <- df_store3_checkins_test$checkins
# Exclude the dependent variable from the exogenous variables dataframe
store1_checkouts_test_features <- df_store1_checkouts_test[, -2]
store2_checkouts_test_features <- df_store2_checkouts_test[, -2]
store3_checkouts_test_features <- df_store3_checkouts_test[, -2]
store1_checkins_test_features <- df_store1_checkins_test[, -2]
store2_checkins_test_features <- df_store2_checkins_test[, -2]
store3_checkins_test_features <- df_store3_checkins_test[, -2]
predictions_store1_checkouts <- predict(rf_store1_checkouts, newdata = store1_checkouts_test_features)
predictions_store2_checkouts <- predict(rf_store2_checkouts, newdata = store2_checkouts_test_features)
predictions_store3_checkouts <- predict(rf_store3_checkouts, newdata = store3_checkouts_test_features)
predictions_store1_checkins <- predict(rf_store1_checkins, newdata = store1_checkins_test_features)
predictions_store2_checkins <- predict(rf_store2_checkins, newdata = store2_checkins_test_features)
predictions_store3_checkins <- predict(rf_store3_checkins, newdata = store3_checkins_test_features)
rmse_store1_rf_checkouts <- rmse(predictions_store1_checkouts, store1_test_checkouts)
rmse_store2_rf_checkouts <- rmse(predictions_store2_checkouts, store2_test_checkouts)
rmse_store3_rf_checkouts <- rmse(predictions_store3_checkouts, store3_test_checkouts)
rmse_store1_rf_checkins <- rmse(predictions_store1_checkins, store1_test_checkins)
rmse_store2_rf_checkins <- rmse(predictions_store2_checkins, store2_test_checkins)
rmse_store3_rf_checkins <- rmse(predictions_store3_checkins, store3_test_checkins)
cat("RMSE Random Forest Store1 Check-outs: ",round(rmse_store1_rf_checkouts,2))
## RMSE Random Forest Store1 Check-outs: 87.93
cat("RMSE Random Forest Store2 Check-outs: ",round(rmse_store2_rf_checkouts,2))
## RMSE Random Forest Store2 Check-outs: 83.2
cat("RMSE Random Forest Store3 Check-outs: ",round(rmse_store3_rf_checkouts,2))
## RMSE Random Forest Store3 Check-outs: 78.42
cat("RMSE Random Forest Store1 Check-ins: ",round(rmse_store1_rf_checkins,2))
## RMSE Random Forest Store1 Check-ins: 72.82
cat("RMSE Random Forest Store2 Check-ins: ",round(rmse_store2_rf_checkins,2))
## RMSE Random Forest Store2 Check-ins: 90.64
cat("RMSE Random Forest Store3 Check-ins: ",round(rmse_store3_rf_checkins,2))
## RMSE Random Forest Store3 Check-ins: 81.29
The table below presents the results, indicating that the OLS model performs significantly worse compared to the SARIMA and random forest models for Store2 and Store3. The random forest performs better than linear regression, which suggests that this machine learning algorithm captures non-linear patterns within the data for Store2 and Store3. However, the SARIMA model effectively utilizes the autoregressive structure present in the data, demonstrating its impressive capability. This once again highlights the power of leveraging the autoregressive properties of a time series. It is worth noting that I intentionally excluded the creation of lagged variables in this analysis to highlight this.
| Algortihm | Store 1 | Store 2 | Store 3 |
|---|---|---|---|
| OLS | 73.38 | 119.46 | 149.75 |
| SARIMA | 34.81 | 28.92 | 37.41 |
| Random Forest | 87.93 | 83.20 | 78.42 |
| Algortihm | Store 1 | Store 2 | Store 3 |
|---|---|---|---|
| OLS | 69.65 | 129.89 | 147.09 |
| SARIMA | 29.62 | 32.08 | 28.09 |
| Random Forest | 72.82 | 90.64 | 81.29 |
Bandara, K., Hyndman, R.J., & Bergmeir, C. (2021). MSTL: A Seasonal-Trend Decomposition Algorithm for Time Series with Multiple Seasonal Patterns. arXiv preprint 2107.13462. https://arxiv.org/abs/2107.13462.
Hyndman, R.J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R, Journal of Statistical Software, 26(3), 1–22. https://doi.org/10.18637/jss.v027.i03.
Huber, J., & Stuckenschmidt, H. (2020). Daily retail demand forecasting using machine learning with emphasis on calendric special days. International Journal of Forecasting, 36(4), 1420-1438, https://doi.org/10.1016/j.ijforecast.2020.02.005.
Sheta, A.F., De Jong, K. (2001). Time-series forecasting using GA-tuned radial basis functions, Information Sciences, 133(3-4) ,221-228. https://doi.org/10.1016/S0020-0255(01)00086-X.